Optimized ingestion settings ensure your data is prepared for efficient search and retrieval by the LLM. You can adjust these settings to optimize data ingestion based on your specific use case and document type.Documentation Index
Fetch the complete documentation index at: https://explore.airia.com/llms.txt
Use this file to discover all available pages before exploring further.
Knowledge Graph Extraction
Knowledge Graph Extraction intelligently identifies and extracts entities from your data chunks, storing them in a graph database. This significantly improves search result quality by returning more relevant chunks based on entity connections. This AI-powered feature extracts key entities from your ingested data. These entities are then captured in a Knowledge Graph, creating explicit relationships between entities and the chunks from which they were extracted. This enriches the context available during search, allowing for a deeper understanding of your data and its connections. A reranker is also implemented to further improve the accuracy and relevance of retrieved chunks.For detailed guidance on configuring industry presets, custom entity types, two extraction modes, file status lifecycle, and cost optimization, see Knowledge Graph Extraction.
⚠️ Warning: Knowledge Graph Extraction involves AI processing, which incurs associated costs. These costs, including reranker costs during retrieval, are tracked in your Token consumption feed.
Enable Knowledge Graph Extraction
To leverage Knowledge Graph Extraction, enable it when creating a new data source.- Navigate to the data source creation interface.
- Locate the Knowledge Graph Extraction option.
- Toggle Knowledge Graph Extraction to enable it.
- Proceed with creating your data source.
View the Knowledge Graph
You can visualize the extracted Knowledge Graph for any data source where the feature is enabled.- From your data source list, locate the desired data source.
- Expand the menu for that data source.
- Select the View Graph option.
⚠️ Warning: The ability to view the graph is not supported for data sources with Original source permissions enabled. This is to ensure that entities and chunks of data are not accessed by users without the necessary permissions. For more details, see Original Source Permissions.
Retrieval with Knowledge Graph Extraction
When a data source has Knowledge Graph Extraction enabled, the retrieval process is significantly enhanced:- Semantic Search: Initial chunks are retrieved based on semantic similarity to your query.
- Entity Expansion: Retrieved chunks are enriched by adding their associated entities and any other related chunks from the Knowledge Graph, significantly broadening the context beyond initial semantic matches.
- Reranking: All expanded results are then reranked to ensure the highest accuracy and relevance.
- Top K Results: The system returns the Top K results you have configured, along with their neighboring chunks as specified in your configuration.
Scan Document for Images
This feature allows the system to generate descriptions for images found within your documents, making image content discoverable through search.💡 Note: This feature is enabled by default.An OCR (Optical Character Recognition) solution is used to extract text from images. This extracted text, along with generated image descriptions, enhances search capabilities by indexing visual content.
Select PDF Parser
💡 Note: This feature is currently available to selected customers who are granted early access. Please contact your sales representative if you wish to also receive early access. Capabilities and pricing for parsers are subject to change.Airia offers several PDF parsers to extract content from your documents. Selecting the right one for your content type significantly improves extraction quality and downstream search results.
⚠️ No parser guarantees 100% accuracy. PDF content varies enormously — scan quality, layout density, language, formulas, checkboxes, and page size all affect extraction. Before ingesting a full data set, we strongly recommend running a small pilot with a handful of your most challenging files, comparing retrieval quality across parsers, and then committing to one. Switching parsers later only applies to newly added or updated files; to re-process existing files with a different parser, you must create a new data source.
Available Parsers
- Basic: Default option. Optimized for simple, born-digital PDFs with selectable text and repetitive simple layouts. The Basic parser itself does not extract image content; instead, every embedded image is forwarded to a vision model when Scan Document for Images is enabled (the default). For PDFs that are image-heavy or fully scanned, prefer Advanced or Universal — they extract most image content directly within the parser, which is faster and more reliable.
- Advanced: Best for content-rich documents — mathematical expressions, formulas, dense tables, multilingual text (including non-Latin scripts), and handwritten notes. Strong on technical, scientific, and academic material.
- Universal: Best for forms with checkboxes, wide-format pages (catalogs, spec sheets, engineering drawings, fold-out diagrams), scanned forms, and documents where field-level layout fidelity matters.
- Intelligent: Highest extraction quality and the most flexible. Uses a PDF Parser Agent powered by an LLM, with a customizable prompt that lets you tailor extraction to your specific use case (formatting rules, structured output, content prioritization). Best when extraction quality is the top priority or when you need a custom output format. Slower and more resource-intensive than the other parsers — see the section below for details.
Choosing a Parser
| Your documents… | Recommended parser |
|---|---|
| Are born-digital PDFs with selectable text and simple layouts | Basic |
| Contain math, formulas, dense tables, multilingual text, or handwriting | Advanced |
| Contain forms with checkboxes, wide-format pages, or scanned forms where layout matters | Universal |
| Are mixed or unknown, where extraction quality is the top priority — or you need a custom output format | Intelligent |
Known Limitations
Each parser has trade-offs that may affect your choice:- Basic does not run page-level OCR, and it does not extract image content directly. Image content is handled by the Scan Document for Images vision-model path, which processes images sequentially and is time-bounded per image. As a result, image-heavy or fully scanned PDFs ingest significantly more slowly under Basic, and some images may not be processed successfully within the retry budget. Advanced and Universal avoid this bottleneck because they extract the majority of image content within the parser itself.
- Advanced can struggle with very wide-format pages (oversized catalogs, fold-out sheets), which are sometimes returned as a single image with content loss. Multi-choice checkboxes can be misread, and charts are typically returned as images rather than transcribed.
- Universal can corrupt or fail to render mathematical formulas correctly.
- Intelligent is slower and more resource-intensive — each page triggers a separate Agent execution, and overall quality depends on the chosen LLM.
Recommended Pilot Workflow
Before ingesting a full data set, validate your parser choice on representative files:- Select a handful of files (5–10 is usually enough), weighted toward your hardest documents — scanned, wide-format, formula-heavy, multilingual, or form-based.
- Create a temporary data source for each parser candidate.
- Ingest the same files into each.
- Run 3–5 questions you expect end users to actually ask.
- Choose the parser that performs best on your hardest cases — not on the average case.
How Intelligent Parser Works
- When Intelligent Parser is selected, a PDF Parser Agent is automatically created in the Data Ingest project.
- The Agent processes the PDF page by page.
- A separate Agent execution is triggered for every page of the document.
- The output is high-quality, structured extracted text optimized for downstream usage.
Prompt Customization
The Intelligent Parser uses a configurable prompt to guide how text is extracted and structured.- The prompt can be modified directly in the interface.
-
Customizations can be tailored to specific use cases, such as:
- Color coding interpretation
- Formatting rules
- Structured output requirements
- Content prioritization
💡 Note: Prompt customization applies only to the specific data store or Agent configuration being edited.
Monitoring Execution & Token Usage
💡 Note: Since the Intelligent Parser uses an LLM, token consumption is associated with each page execution.You can monitor activity through:
- Agent Execution Feed – View execution history and status.
- Token Consumption Feed – Track token usage and associated costs.
Changing the LLM Model
The underlying LLM can be swapped depending on performance, cost, or quality requirements.Important: Changing the LLM model is a global update and will apply to all data sources and Agents that use that model.
To change the LLM:
- Navigate to the Data Ingest project
- Edit the PDF Parser Agent
- Update the selected LLM / model
- Save the changes
Edit the Selected Parser
You can change the PDF parser for your data source. Go to the option menu next to your data source and click Edit. From the edit screen, select a new parser. This new parser will be applied to all newly added or updated files within the data source after sync. To apply the new parser to all existing files, you must create a new data source with the desired parser setting.Configure Text-to-SQL for Structured Data
Text-to-SQL allows you to interact with your structured data (specifically.csv and .xlsx files) using natural language queries, which are then translated into SQL.
When to Use Text-to-SQL
Use Text-to-SQL when you need to ask precise, qualitative questions about your structured data, such as:- “What is the revenue generated by product A for the year to date?”
- “How many leads have we generated for the last year?”
How to Use Text-to-SQL
1. Set Up Your Data Source
Begin by setting up your data source with the relevant.csv or .xlsx files. The data source can also contain other file types.
2. Activate SQL Indexing
In the Ingestion settings for your data source, activate the SQL indexing option. For your.csv/.xlsx files, choose one of the following:
-
Semantic: When selected, only vectors will be generated for the structured files. This enables text search based on meaning and context. Choose this for semi-structured tabular data where natural language understanding is key.
💡 Example: For a survey documented in an Excel file with open-ended customer answers, use Semantic. Question: “What are the common complaints customers have about Agent Builder?”
-
SQL Only: When selected, the file will be indexed as SQL only, without enabling semantic search. Choose this for highly structured data where precise, quantitative answers are expected.
💡 Example: Question: “How many complaints are registered as High priority?”
-
Both: When selected, both vectors and SQL indexes will be generated for the structured files. This can enhance retrieval accuracy but will trade off speed and cost due to dual retrieval.
💡 Note: Both is the default option for Text-to-SQL setting. For all other file types within the same data source, only vector embeddings (semantic search) will be generated.
3. Checking Ingestion Status for Structured Files
For.csv/.xlsx files, you can monitor their ingestion status directly within the data source view. The status indicates the success of both SQL and vector indexing:
- Ready: Both the SQL index and vector embeddings have been successfully created.
- Failed: Both the SQL index and vector embeddings have failed to be created. You can check the reason for failure in the Failed files logs (indicated by a red button at the top of the page).
- Partial: One of the two indexes (either SQL or vector) has failed, while the other was successful. Hover over the “Partial” status to see which specific index is ready and which has failed. The reason for the failed index can also be found in the Failed files logs.
4. Use in the Agent
In your Agent’s workflow, activate the Text-to-SQL retrieval option in the Data Source step. By default, this option is disabled, and the Data Source relies on Semantic retrieval. Enabling Text-to-SQL search will specifically query through.csv and .xlsx files from the connected Data Source.
💡 Example: To retrieve all sales records from an Excel file where sales exceed $5,000 and the date is within Q1 2025, a SQL query like SELECT * FROM sales WHERE amount > 5000 AND date LIKE '2025-01%' provides an efficient and precise solution by leveraging the file’s structured format.
💡 Hint: If you want to enable both Semantic and SQL search types (e.g., when your Data Source contains both.csv/.xlsxfiles and other file types, or if you chose the Both option for your structured files), you can drag and drop the Data Source step twice onto the canvas. Configure one copy to use Semantic retrieval and the other to use SQL retrieval, then connect both to the LLM.
Text-to-SQL Agent Settings
Model Selection
You need to select the LLM that will be used in the agentic workflow for Text-to-SQL. The LLM is fully responsible for SQL query generation. We recommend using “High Quality Capable” models to achieve stable and accurate results. Recommended models (tested):-
High Quality (best performance):
- Claude 4 Sonnet
- GPT 4.1
- Claude 3.7 Sonnet
- GPT 4o
-
Sufficient Quality:
- GPT 4.1 mini
- Claude 3.5 Sonnet
- GPT 4o mini
Fuzzy Search
You can enable Fuzzy search to allow the system to search through records even if there are misspellings in the user’s query. Note that Fuzzy search can increase query generation complexity. When the Agent runs with the configured Data Source step, it will produce results based on the chosen settings. The Text-to-SQL retrieval agentic flow will output a structured result from the dynamically generated SQL query, based on the user’s natural language input. The choice between semantic retrieval and SQL retrieval for agents depends on the query type, data structure, scalability needs, and maintenance considerations. For structured files like.csv and .xlsx with precise, structured queries, SQL retrieval is preferred for its efficiency, accuracy, and ability to answer qualitative questions. For natural language queries or when dealing with text fields requiring semantic understanding, semantic retrieval is advantageous. In practice, combining both methods often provides the most flexible and effective solution, especially for agents interacting with users through natural language.
Configure Vector Database
The chosen Vector Database significantly impacts search capabilities, especially regarding hybrid search.Available Options
- Airia DB: This is the default vector database option. The proprietary database supports Hybrid search by default. If Hybrid search is turned off, only dense vectors will be generated and semantic search only will be available for the data source. This is the default vector database option.
- Pinecone BYOK (Bring Your Own Key): Depending on the index you provide in your Pinecone database, it can enable Hybrid Search. If the index supports hybrid search (i.e., it’s configured for both dense and sparse vectors), Airia will, by default, generate both sparse and dense vectors in your Pinecone database to enable this capability. Required are Pinecone index name and API key.
- Weaviate BYOK (Bring Your Own Key): Hybrid Search is always available with Weaviate. Weaviate applies Fusion algorithms for ranking results from both keyword (lexical) and semantic searches, enhancing relevance. You can learn more about fusion algorithms in the Weaviate blog. Required are Weaviate endpoint and API key.
- Azure AI BYOK (Bring Your Own Key): Hybrid Search is always enabled by default. Azure AI does not support Fusion algorithms for ranking results. Required are AzureAI endpoint and API key.
